ProbeTools: designing hybridization probes for targeted genomic sequencing of diverse and hypervariable viral taxa

Background Sequencing viruses in many specimens is hindered by excessive background material from hosts, microbiota, and environmental organisms. Consequently, enrichment of target genomic material is necessary for practical high-throughput viral genome sequencing. Hybridization probes are widely used for enrichment in many fields, but their application to viral sequencing faces a major obstacle: it is difficult to design panels of probe oligo sequences that broadly target many viral taxa due to their rapid evolution, extensive diversity, and genetic hypervariability. To address this challenge, we created ProbeTools, a package of bioinformatic tools for generating effective viral capture panels, and for assessing coverage of target sequences by probe panel designs in silico. In this study, we validated ProbeTools by designing a panel of 3600 probes for subtyping the hypervariable haemagglutinin (HA) and neuraminidase (NA) genome segments of avian-origin influenza A viruses (AIVs). Using in silico assessment of AIV reference sequences and in vitro capture on egg-cultured viral isolates, we demonstrated effective performance by our custom AIV panel and ProbeTools’ suitability for challenging viral probe design applications. Results Based on ProbeTool’s in silico analysis, our panel provided broadly inclusive coverage of 14,772 HA and 11,967 NA reference sequences. For each reference sequence, we calculated the percentage of nucleotide positions covered by our panel in silico; 90% of HA and NA references sequences had at least 90.8 and 95.1% of their nucleotide positions covered respectively. We also observed effective in vitro capture on a representative collection of 23 egg-cultured AIVs that included isolates from wild birds, poultry, and humans and representatives from all HA and NA subtypes. Forty-two of forty-six HA and NA segments had over 98.3% of their nucleotide positions significantly enriched by our custom panel. These in vitro results were further used to validate ProbeTools’ in silico coverage assessment algorithm; 89.2% of in silico predictions were concordant with in vitro results. Conclusions ProbeTools generated an effective panel for subtyping AIVs that can be deployed for genomic surveillance, outbreak prevention, and pandemic preparedness. Effective probe design against hypervariable AIV targets also validated ProbeTools’ design and coverage assessment algorithms, demonstrating their suitability for other challenging viral capture applications. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08790-4.

Probe panels are typically designed by enumerating probe-length sub-sequences (k-mers) from reference sequences. The simplest approach to designing probes for hypervariable taxa is to enumerate k-mers from an exhaustive collection of reference sequences, thereby including as much genomic divergence in the design space as possible [7,8]. This approach becomes problematic, however, when redundant probe sequences are enumerated from repeated instances of conserved genomic loci.
A few strategies have been used to address this redundancy problem. One common strategy is to cluster similar k-mers after they have been enumerated [6,7]. Another strategy is to align candidate probe sequences against select reference genomes to identify and retain only those probes targeting divergent genotypes [8]. Redundancy has also been addressed by constraining the design space to a limited number of representative reference genomes, selected either by manual curation or clustering [9][10][11][12]. Some of these strategies have been supplemented with multiple sequence alignments over hypervariable loci or entire genomes so that probes are designed from consensus and degenerate sequences [9,10].
Spacing between probe sequences is another complicated design consideration. Regular spacing (tiling) is the most common approach because it is easy to implement, but it does not ensure optimal positioning of probes. Reducing the spacing increases the likelihood that some enumerated probes are optimally positioned, but it also increases the number of probe candidates and any associated computation to collapse redundancy among them. Creating the smallest possible panel of probes that optimally covers the entire target space quickly becomes an intractable computational problem, one that had led to increasingly complicated approaches including sophisticated minimization of loss functions [13].
Efforts to address viral hypervariability have resulted in several elaborate probe design algorithms. Unfortunately, these have largely been implemented on a study-by-study basis and have not resulted in generalpurpose software tools that can be easily used by others. Meanwhile, among the handful of published probe design packages, there is only one option that specifically addresses viral hypervariability [13]. The rest are intended for comparatively conserved eukaryotic genomes and are inadequate for many viral applications [14][15][16][17]. This leaves virologists with limited options for designing their own hybridization probes, especially if they have minimal capacity for custom programming, sophisticated mathematics, and experimental bioinformatics.
Here, we present ProbeTools, a general-purpose software package for designing compact probe panels against diverse viral taxa and other hypervariable genomic targets. It can also be used to assess how well existing panels cover user-provided target sequences. ProbeTools implements the established K-mer clustering method, but it adds a novel incremental design heuristic to minimize the generation of redundant probes. It also provides a simple command line user interface for ease of use and automation.
In this study, we demonstrate ProbeTools' effectiveness by designing capture panels for avian-origin influenza A viruses (AIVs). These viruses are subtyped by two hypervariable viral surface proteins called haemagglutinin (HA) and neuraminidase (NA), making them an appropriately challenging case study for ProbeTools. The genome segments encoding these proteins have diversified into 16 avian-origin HA subtypes and 9 avianorigin NA subtypes, giving rise to 144 possible combinations and the HxNx nomenclature used in both animal and human contexts (e.g. H1N1, H3N2, H5N1, H7N9). Furthermore, each of these subtypes has diverged into numerous clades, many of which have been only partially characterized [12,18,19].
In this study, we designed and validated a compact panel of 3600 probes for detecting and subtyping AIVs. Our results showed broad inclusivity against all avianorigin HA and NA subtypes based on in silico predictions against of tens-of-thousands of AIV reference sequences. We also demonstrated successful captures in vitro on a representative collection of 23 egg-cultured AIVs. These results validated the core ProbeTools algorithms and

Results
Assessing basic k-mer clustering and marginal improvements to target coverage with additional probes We began by assessing probe design against hypervariable targets with a basic k-mer clustering algorithm, wherein all 120-mers were enumerated from a target space of AIV reference sequences then clustered based on 90% nucleotide sequence identity. We used this strategy, implemented in the ProbeTools clusterkmers module, to generate probe panels of increasing size against 14,772 HA segment reference sequences and 11,967 NA segment reference sequences. We then used the ProbeTools capture module, which aligns probe sequences against target sequences, to assess target space coverage, i.e. the percentage of nucleotide positions in each target sequence covered by at least one probe in the panel (Fig. 1A, solid lines). As expected, panels with more probe sequences provided better target space coverage, however we observed diminishing marginal Fig. 1 Incremental design strategy improves upon basic k-mer clustering for probe panel design. Panels were designed against target spaces of 14,772 haemagglutinin (HA) and 11,967 neuraminidase (NA) genome segment reference sequences. The ProbeTools clusterkmers module was used to make panels using basic k-mer clustering and the makeprobes module was used to make panels with an incremental strategy. For each panel, probe coverage of reference sequences was assessed in silico using the ProbeTools capture module. A For both strategies, increasing panel size improved the 10th percentile of reference sequence coverage with diminishing marginal increases, but incrementally designed panels achieved more extensive coverage at larger panel sizes. Incrementally designed panels also provided better coverage of the worst-covered reference sequence using fewer probes. B Incrementally designed panels shifted coverage distributions upwards for the worst-covered reference sequences. Each reference sequence in the target space is represented as a dot, plotted according to its probe coverage. Coverage of the worst-covered reference sequence and 10th percentile of all reference sequences are indicated above the axis. C Incrementally designed panels improved reference sequence coverage by re-distributing probes from regions with deep coverage (4 or more probes) to regions with shallow coverage (2 or fewer probes) improvements for both HA and NA genome segments. We also noted that reference sequences with no probe coverage remained in the target space past the point of diminishing marginal returns. These results highlighted two limitations of the basic k-mer clustering approach: some HA and NA segments remained undetected despite designing additional probes, and additional probes provided only modest and diminishing improvements to the distribution of target coverage.

Improving target coverage with incremental panel design focused on poorly covered targets
To address the limitations we observed with basic k-mer clustering, we devised an incremental design strategy to improve marginal coverage increases, especially for poorly covered targets. In this strategy, basic k-mer clustering was used to design panels in smaller batches of 100 probes. After adding each batch to the growing panel, target space regions without probe coverage were identified using the capture module. These low coverage regions were then extracted with another ProbeTools module called getlowcov and used as a new target space for designing the next batch. In this way, each subsequent batch of probes was focused on regions not already covered by the panel.
We compared target space coverage for panels designed with this incremental strategy against panels designed above using basic k-mer clustering (Fig. 1). The incremental strategy provided higher 10th percentiles of coverage, especially for HA panels larger than 2000 probes and NA panels larger than 1200 probes (Fig. 1A). Furthermore, the incremental strategy provided better coverage for the worst-covered reference sequences (Fig. 1AB). We also compared depth of probe coverage, i.e. the number of probes covering each nucleotide position in target sequences (Fig. 1C). This comparison indicated that the incremental strategy improved target coverage by redistributing probes from positions with deep coverage to shallow coverage. We speculate that the incremental approach, by removing already-covered regions from the target space after each batch, limited the enumeration of adjacent, partially overlapping k-mers that provided redundant coverage. Based on the observed performance improvements of the incremental strategy, it was implemented as an additional self-contained ProbeTools module called makeprobes (Fig. 2).

Predicted coverage of HA and NA subtypes by AIV_v1 panel
Using the incremental strategy implemented in the ProbeTools makeprobes module, we generated an AIVtargeting probe panel called AIV_v1. It was composed of 1935 HA-specific probes and 1435 NA-specific probes.
We also included 184 probes targeting the highly conserved matrix segment (M) which is the standard AIV diagnostic target [24,34]. We then used the ProbeTools capture module to predict probe coverage using the AIV_v1 panel for all 36,313 AIV reference sequences in the target space. The minimum, maximum, and 10th percentile of reference sequence coverage was calculated for each HA and NA subtype and the M segment (Fig. 3A).
We observed that M segments had the best coverage followed by NA subtypes then HA subtypes, reflecting the comparative levels of genomic diversity within these genome segments. No reference sequence had less than 59.6% coverage, which is sufficient for segment and subtype identification. HA subtypes H5, H7, and H9 are considered high priority for AIV surveillance because they frequently cause agricultural outbreaks and novel influenza infections in humans [23][24][25][26]34]; 90% of H5, H7, and H9 reference sequences had at least 94.4, 88.5, and 92.4% probe coverage respectively. We also noted a significant positive monotonic association between a subtype's target coverage distribution and number of reference sequences from that subtype in the target space (Fig. 3B). This indicated that over-representing subtypes in the target space resulted in preferential design and better probe coverage for these targets, e.g. the high priority subtypes H5, H7, and H9.

In vitro capture of diverse egg-cultured influenza isolates
After assessing the AIV_v1 panel in silico, we had it synthesized and used it to perform in vitro captures on a collection of diverse egg-cultured AIV isolates (Table 1). We ensured that each avian-origin HA and NA subtype was represented in the collection, and we included isolates from wild birds, poultry, and humans. The collection contained 22 egg cultures, including one mixed infection, providing 23 viruses and 69 distinct HA, NA, and M segments for in vitro capture.
Sequencing libraries were prepared from each isolate then pooled. AIV library pools were diluted 1:100 (ng/ng) in libraries of background material made from mock-infected egg cultures, then captured three times independently using the AIV_v1 panel. Pre-and postcapture pools were sequenced to calculate mean foldenrichment at each nucleotide position in these 69 HA, NA, and M segments. Half of all nucleotide positions had a mean fold-enrichment greater than 351.2-fold, and 90% of nucleotide positions had a mean foldenrichment greater than 195.0-fold (Fig. 4A). We also calculated the percentage of the capture pools composed of background material from the mock-infected egg cultures, then compared these percentages pre-and post-capture (Fig. 4B). Before capture, the mean background percentage was 99.17%, but this was reduced to 0.03% following capture. Together, these data demonstrate effective enrichment of AIV material and removal of background by probe capture with the AIV_v1 panel.
We also used these in vitro results to assess breadth of enrichment, i.e. the percentage of nucleotide positions in each HA, NA, and M segment that had been significantly enriched by probe capture (Fig. 4C, Table S1). Breadth of enrichment was greater than 96.3% for 64 of 69 segments in the collection, and it was not less than 46.5% for any segment, which is sufficient for segment and subtype identification (Table S3). Nine isolates contained high priority H5, H7, and H9 segments, all of which had greater than 98.7% breadth of enrichment. This included two isolates from zoonotic human infections (H5N1 and H7N9), which were extensively enriched despite the absence of reference sequences from human infections in the target space used for probe design.
We further examined the five segments with less than 96.3% breadth of enrichment to understand why they were apparently not captured in full. First, we used the ProbeTools capture module to assess if the AIV_v1 panel lacked probes targeting their particular genome segment sequences. We observed that most positions without significant enriched were nonetheless extensively covered by the probe panel (Fig. 5A). This indicated that insufficient design by ProbeTools was not a major explanation for the lack of significant capture of these segments. The ProbeTools-designed AIV_v1 panel provided broadly inclusive coverage in silico of avian-origin HA subtypes, NA subtypes, and M segments. The AIV_v1 panel of 3600 probes was designed using the ProbeTools makeprobes module. It was composed of 1935 haemagglutinin (HA) segment-specific, 1435 neuraminidase (NA) segment-specific, and 184 matrix (M) segment-specific probes. A Coverage predictions against 36,313 reference sequences were generated with the ProbeTools capture module and stratified by subtype for HA and NA segments. The minimum, 10th percentile, and maximum of probe coverage against reference sequences from each subtype/segment are indicated. B A significant positive monotonic association was observed between the number of sequences from a subtype in the target space and that subtype's 10th percentile of coverage. Each dot represents an HA or NA subtype, and the results of Spearman's rank correlation test are indicated on the plots Next, we assessed whether experimental factors were responsible for nucleotide positions in these segments failing to achieve statistically significant enrichment. Fold-enrichment values between positions with and without significant enrichment were comparable, but variation between capture replicates were significantly different, with higher variation for positions that were not significantly enriched (Fig. 5 BC). We attribute this to sub-optimal cDNA synthesis for the affected positions, causing under-representation of these positions in the material that was captured, lower depths of coverage, and higher stochasticity (Fig. 5D). Despite this source of experimental variation, and the limited number of replicates that was practical for us to perform, only 3.1% of nucleotide positions across all HA, NA, and M segments were impacted, and most of these positions only barely failed the enrichment significance test (half achieved a p-value < 0.07) (Fig. 5E). Overall, our in vitro capture results demonstrated that the ProbeTools-designed AIV_v1 panel performed well on real viral isolates, effectively removing background material and providing high breadths of enrichment across HA, NA, and M segment targets.

Comparison of in silico probe coverage prediction and in vitro probe capture enrichment
ProbeTools relies on in silico coverage assessment by the capture module, both for final panel evaluation and for identifying poorly covered sequences during incremental design. To validate ProbeTools' coverage assessment algorithm, we examined how closely its in silico predictions agreed with in vitro capture results on egg-cultured AIV isolates.
Using the ProbeTools capture module, we determined which nucleotide positions in the egg-cultured AIVs were predicted to be covered by the AIV_v1 probe panel. We then compared these predictions to our in vitro capture results to see if significant enrichment had actually occurred at these nucleotide positions ( Fig. 6 and Fig. S1). Predicted probe coverage and significant enrichment results were concordant for 89.2% of nucleotide positions. Only 2.3% of nucleotide Table 1 Representative collection of egg-cultured avian influenza virus isolates. Isolates were selected to provide representation of each avian-origin haemagglutinin (HA) and neuraminidase (NA) subtype as well as infections from poultry, wild bird, and human hosts. Each specimen was given a name based on an abbreviation of its host type and a sequential number (P for poultry, WB for wild bird, and H for human). Poultry and wild bird isolates were obtained from the Canadian Food Inspection Agency's National Centre for Foreign Animal Disease (CFIA NCFAD), and human isolates were obtained from the Public Health Agency of Canada's National Microbiology Laboratory (PHAC NML). Isolate subtypes were confirmed in-house by genome sequencing positions targeted by the AIV_v1 panel were not significantly enriched. These were concentrated in the five segments discussed above that were impacted by variability between replicates (Fig. S2). We also noted that 7.7% of nucleotide positions were significantly enriched despite not being targeted by the AIV_v1 panel, a phenomenon that was observed in most segments across all isolates ( Fig. 6 and Fig. S1). We attribute this to the capture of larger fragments containing untargeted sequences adjacent to the location annealed by the probe. It might also indicate that local alignment parameters used by ProbeTools capture are more conservative than actual annealing thermodynamics. Either way, these results showed that ProbeTools predictions generally reflected actual capture of target genomic material, and in silico predictions more often underestimated panel performance when predictions were incorrect.

Discussion
This study highlighted some important considerations when designing panels using ProbeTools. Foremost among these was the effect of target space composition on panel inclusivity. In this AIV case study, we noted a significant positive monotonic association between panel coverage and the number of reference sequences representing a particular subtype in the target space. Based on how the ProbeTools algorithm ranks probe candidates by the number of k-mers in the cluster they represent, it stands to reason that over-representing similar taxa (which would contain many similar k-mers) would bias the resulting panel towards these taxa. Consequently, ProbeTools users should have a thorough knowledge of the contents of their target space and the possible sources of sampling bias in the databases from which they obtain their reference sequences. In the case of AIVs, the agricultural impacts and public health threats of Fig. 4 Effective in vitro capture of egg-cultured avian influenza virus isolates using the ProbeTools-designed AIV_v1 panel. The AIV_v1 panel of 3600 probes was designed using ProbeTools, and it was used to capture sequencing libraries made from a representative collection of 23 egg-cultured avian influenza viruses (AIVs) (described in Table 1). AIV libraries were pooled together, diluted 1:100 (ng/ng) in libraries of background material made from mock-infected egg cultures, then captured three times independently. A Pre-and post-capture pools were sequenced to calculate fold-enrichment at each nucleotide position in the haemagglutinin (HA), neuraminidase (NA), and matrix (M) genome segments of these isolates (mean of three independent replicates). B Background material from mock-infected egg cultures was effectively removed during probe capture. C Breadth of enrichment, i.e. the percentage of nucleotide positions that were significantly enriched by probe capture, was calculated for each HA, NA, and M genome segment in these isolates certain HA subtypes have led to more frequent sequencing of these subtypes and accessioning of their genome sequences in popular databases. For our panel, this contributed to bias towards subtypes like H5, H7 and H9. Whether this is a benefit or limitation will depend on the intended application. In the context of outbreak prevention and pandemic preparedness, a panel biased towards taxa that are known for their agricultural impact and zoonotic potential is beneficial. If the objective is to characterize viral diversity and ecology in wildlife, however, this could be a limitation.
To obtain the best results, ProbeTools users should purposefully curate their target space to serve their probe capture objectives. Users may want to identify taxa whose detection is a priority and over-represent them in the target space. Conversely, users may want Fig. 5 Lack of significant enrichment in segments with lower breadths of enrichment was due to experimental variation between capture replicates instead of insufficient probe design. A representative collection of 23 egg-cultured avian influenza viruses was captured three times independently using the ProbeTools-designed AIV_v1 panel. A ProbeTools capture was used to predict probe panel coverage of positions without significant enrichment from 5 genome segments with breadths of enrichment less than 96%. These positions were extensively targeted by probes in the AIV_v1 panel. B Fold-enrichment was comparable for positions with and without significant enrichment. The difference in distribution means was only 1.09-fold, although it was statistically significantly (p < 0.0001, Welsh's t-test) due to the large number of nucleotide positions involved in the comparison (n = 96,376 and n = 3082 for positions with and without significant enrichment respectively). C Variation in fold-enrichment between three independent replicates was significantly higher for positions that did not achieve significant enrichment (p < 0.0001, Levene's test). D Positions that failed to achieve statistically significant enrichment were significantly under-represented in the material that was captured (p < 0.0001, Welsh's t-test). E Most positions with insignificant enrichment narrowly failed the enrichment test's pre-determined alpha level of 5% to 'flatten' their target space to ensure no particular taxa, clades, subtypes, etc dominate. This could be done manually, by selecting specific sequences to represent relevant groups, or it could be attempted bioinformatically by pre-clustering target sequences, providing the number and length of target sequences do not make this computationally prohibitive.
Another strategy could be to use the various Probe-Tools modules to extract low coverage sequences from specific groups whose target sequences have poor probe coverage after a core panel is designed. For instance, had H15 subtype AIVs been a surveillance priority in this study, supplemental H15-specific probes could have been designed by running the capture, getlowcov, and makeprobes modules on the H15 subset of target sequences after noting their comparatively low coverage by the main panel. In this way, the modular nature of ProbeTools and the relatively simple-to-understand algorithms within each module empower users to experiment and find creative solutions. This flexibility is instrumental for tailoring probe panels to the needs of the user and their specific viral capture application.

Conclusions
In this study, we used ProbeTools to create an effective and broadly inclusive panel of hybridization capture probes for subtyping AIVs. Our results show the utility of this panel as a tool for AIV surveillance, outbreak prevention, and pandemic preparedness. They also demonstrate that ProbeTools can effectively design probes against hypervariable genomic targets like avian-origin HA and NA segments. This validation of ProbeTools' core design and coverage assessment algorithms shows that they are suitable for other challenging design applications, e.g. other viruses with hypervariable genes and pan-viral capture panels targeting multiple diverse taxa.
An increasing frequency of zoonotic outbreaks, epidemics, and pandemic crises has renewed interest in characterizing viral diversity at the interface of wildlife, livestock, game, and humans [35][36][37][38]. Genomic sequencing is becoming central to these One Health ventures. Viral capture panels will need designing and updating as our knowledge of viral threats continues to expand [39,40].
The on-going COVID-19 pandemic has also demonstrated the value of viral genomics to public health [41][42][43][44], resulting in unprecedented investments in sequencing capacity at public health laboratories. This will expand routine genomics for numerous common pathogens, requiring the development of new target enrichment protocols. The COVID-19 pandemic has popularized the use of tiled multiplex PCR for viral genome enrichment in clinical and public health applications [45,46], but on-going genomic drift is likely to cause amplicon dropouts and require frequent primer scheme redesigns for many pathogens, as has already been observed for SARS-CoV-2 [47]. Due to their longer length and, thus, higher tolerance of nucleotide mismatches [6], hybridization probe panels would require less frequent assay upkeep. To illustrate this principle, we used ProbeTools to design a SARS-CoV-2 panel containing 322 probes based on 1899 reference sequences from the first 2 months of the pandemic (January and February 2020). We then assessed in silico how well this panel covered 36,038 sequences from the most recent 2 months of the pandemic (May and June 2022); the tenth percentile of target coverage was 99.41% and the minimum was 98.19%, demonstrating that hybridization probes, especially panels designed by ProbeTools, can withstand genetic drift.
Furthermore, targeted enrichment protocols could be easily parallelized for multiple pathogens with probe capture; specimens containing different pathogens could be prepared into libraries concurrently and even pooled for a single capture using a pan-pathogen panel [8,9,11]. Amplicon sequencing, on the other hand, would require separately performed multiplex PCR reactions for each different pathogen, decreasing laboratory throughput.
Genomic sequencing is maturing into a routine tool for viral discovery, OneHealth surveillance, and clinical microbiology. Hybridization probe capture offers an Fig. 6 In silico predictions of probe coverage by ProbeTools were highly concordant with actual in vitro enrichment of egg-cultured AIV isolates. A representative collection of 23 egg-cultured avian influenza viruses was captured three times independently using the ProbeTools-designed AIV_v1 panel. Pre-and post-capture pools were sequenced to determine which nucleotide positions in the haemagglutinin (HA), neuramindase (NA), and matrix (M) genome segments of these isolates had been significantly enriched. The ProbeTools capture module was used to assess which nucleotide positions of these HA, NA, and M genome segments were targeted by the ProbeTools-designed panel. Each cell indicates the number of nucleotide positions meeting the corresponding in silico prediction and in vitro capture conditions enrichment method that is durable against genomic drift and conducive to high-throughput, parallelized workflows for numerous pathogens. ProbeTools facilitates probe design tasks for these endeavours.

ProbeTools modules
ProbeTools consists of five main modules written in Python (v3.7.3) that perform essential tasks in the probe design process. ProbeTools is freely available under the MIT License. It can be installed easily using the Anaconda/ Miniconda package and environment manager. Alternatively, it can be installed via the Python Package Index, followed by separate installation of its VSEARCH and BLASTn dependencies. Installation instructions, source code, documentation, and usage examples are available at https:// github. com/ Kevin Kuchi nski/ Probe Tools.
The clusterkmers module enumerates and clusters probe-length k-mers from user-provided target sequences. 1) K-mers are enumerated using a sliding window that advances by a specified number of bases. The user may also specify the width of the window. 2) K-mers are clustered based on nucleotide sequence similarity using VSEARCH cluster_fast [48]. 3) Centroid sequences from each cluster are ranked by the size of the cluster they represent. Centroids from larger clusters are assumed to be better probe candidates by virtue of having similarity to more k-mers in the target space. By default, clusterkmers enumerates 120-mers, advancing the window one base at a time, and it clusters using a nucleotide sequence identity threshold of 90%. Previous studies have observed effective hybridization between targets and probes with this degree of sequence similarity [9,11].
The capture module predicts how well user-provided probe sequences cover user-provided target sequences. 1) Each probe sequence is locally aligned against each target sequence using BLASTn [49]. 2) Alignments are filtered, retaining those with a minimum sequence identity over a minimum alignment length. 3) Subject alignment start and end coordinates are extracted from the BLASTn results to determine which nucleotide positions in the target sequences are covered by probes. By default, capture requires 90% sequence identity over at least 60 bases to assign probe coverage to the aligned positions.
The getlowcov module uses the output of capture to extract genomic regions with low coverage from the provided targets. This allows for additional probe design focused on poorly covered regions of the target space. This module returns all sub-sequences where a minimum number of consecutive bases were covered by fewer than a specified number of probes. By default, getlowcov returns all sub-sequences over 40 bases in length where all bases in the sub-sequence were covered by zero probes.
The stats module uses the output of capture to calculate coverage statistics. For each provided target, it calculates the percentage of nucleotide positions covered by varying numbers of probes ("target coverage" and "probe depth").
The makeprobes module chains the previous modules together to implement a generalized incremental design strategy (Fig. 2). In this strategy, probes are designed in batches, and regions of the target space with probe coverage are removed between batches so that additional probes are focused on poorly covered areas. This module can be used as a convenient departure point for custom designs. The user is only required to provide target sequences and select a batch size. They can optionally specify a maximum panel size and target space coverage goal. The makeprobes module iterates through its design loop, adding batches of probes to the panel until the maximum panel size is met, the target space coverage goal is achieved, or no further probes can be generated.

Preparation of AIV target space
All available full-length influenza A virus genome segment sequences from avian hosts were downloaded from the Influenza Research Database (www. fludb. org) on Dec 5, 2017 [50]. Sequences containing degenerate bases were removed to avoid low quality entries. Sequences were then clustered using VSEARCH cluster_fast (v1.0.7) [48] with a 100% sequence identity threshold to remove redundant entries. The remaining entries were used as our final AIV target space (described in Table 2).

AIV_v1 probe panel design and in silico coverage assessment
The AIV_v1 panel was designed against our final AIV target space using the ProbeTools makeprobes module as follows: 2000 probes were designed against HA targets in 20 batches of 100 probes; 1500 probes were designed against NA targets in 15 batches of 100 probes, and 200 probes were designed against M targets in 20 batches of 10 probes. All probes were 120 nucleotides in length, and designs were conducted using makeprobes with default parameters. Designs were conducted with ProbeTools v0.0.5, VSEARCH v1.0.7, and BLASTn v2.2.31.
The top-ranked 1935 HA probes, 1435 NA probes, and 184 M probes were combined into the final panel. Additional probes were added to the panel for potential control and validation applications, including 36 probes targeting the common reference strain A/Puerto Rico/8/34 and 10 probes targeting synthetic spikein DNA oligomers with randomly generated artificial sequences. This provided a final panel of 3600 probes (a breakpoint in the manufacturer's pricing structure), which was synthesized as a custom panel by Twist Bioscience (San Francisco, CA, USA). Sequences for probes in the AIV_v1 panel are provided in Supplemental Material 1. In silico coverage assessment of the AIV_v1 panel, both against the reference sequence target space and the consensus sequences of the egg-cultured isolate collection, were conducted using the capture and stats modules with default parameters.

Preparation of sequencing libraries from egg-cultured influenza isolates
Detailed laboratory procedures for the following are provided in Supplemental Material 2. RNA extracts from egg-cultured AIV isolates and mock infected eggs were provided by the Canadian Food Inspection Agency's National Centre for Foreign Animal Disease (Winnipeg, Manitoba, Canada) and the Public Health Agency of Canada's National Microbiology Laboratory (Winnipeg, Manitoba, Canada). Eggs were not directly handled by the authors. cDNA was prepared from each RNA extract using a previously described method [51]. cDNA was fragmented by sonication, then prepared into sequencing libraries for Illumina platforms with unique dual index barcodes. Adapter-ligated cDNA was split into three separate barcoding reactions, providing three separately barcoded replicate libraries for each isolate.

Probe capture enrichment and genomic sequencing of libraries prepared from egg-cultured influenza isolates
Detailed laboratory and bioinformatic procedures for the following are provided in Supplemental Material 2. 1) Three pools were prepared, with each pool containing one replicate library from each AIV isolate. These pools were sequenced in-house on Illumina MiSeq to generate full HA, NA, and M segment sequences for each isolate and to confirm HA and NA subtypes. 2) Each pool was diluted in 1:100 (ng/ng) in one of three replicate libraries of background genomic material that had been prepared from a mock-infected chicken egg. Aliquots of each diluted pool were sequenced pre-capture at Canada's Michael Smith Genome Sciences Centre (Vancouver, BC) on one Illumina HiSeq X lane to establish baseline HA, NA, and M segment abundance. 3) Each diluted pool was independently captured using the AIV_v1 probe panel. Captured pools were then sequenced in-house on Illumina MiSeq to assess target enrichment of HA, NA, and M segments post-capture.
Analysis of significant probe capture enrichment for egg-cultured AIV isolates 1) Pre-and post-capture depths of coverage were determined by mapping each library's sequencing reads to the HA, NA, and M segment sequences of its corresponding AIV isolate. 2) Depths of coverage were normalized by dividing raw pre-and post-capture read depths by the total reads in the corresponding pre-and post-capture pools (Table S2). 3) For each library, foldenrichment at each nucleotide position was calculated by dividing the normalized post-capture read depth by Table 2 Avian influenza virus reference sequences used as target space in this study. Full-length genome segment sequences from avian hosts were downloaded from the Influenza Research Database (www. fludb. org). Sequences containing degenerate bases were removed, then the remaining sequences were clustered using a 100% nucleotide sequence identity threshold to discard redundant entries. This provided a final target space of 36,313 reference sequences representing all avian-origin haemagglutinin (HA) subtypes, neuraminidase (NA) subtypes, and matrix (M) segments the normalized pre-capture read depth. 4) For each AIV isolate, mean fold-enrichment was calculated at every nucleotide position from the fold-enrichment values of its three independently captured replicate libraries. 5) Mean fold-enrichment values and their standard deviations were used to determine if significant enrichment had occurred at all nucleotide positions using a one-sample T-test against the fixed value of one-fold enrichment with an alpha level of 5%.